" High Performance Computing"related to papers

Abstract:Aiming at the problems existing in the current coal mine monitoring data acquisition system, such as inconsistent data coding standards, poor single system operation reliability, data visualization tools that do not support the Web mode and the host cannot be switched intelligently, based on the industrial internet platform technology system, the front and rear end separation and Restful API interface access technologies are used to integrate Internet open source reports and other components, Research and design a microservices oriented coal mine production monitoring data acquisition system, introduce the existing problems of the current coal mine monitoring data acquisition system, complete the architecture design of the microservices oriented coal mine monitoring data acquisition system, and specifically elaborate the key technologies such as data unified standardization, single channel acquisition microservices processing, data service status monitoring and data and information visualization components. The coal mine production monitoring data acquisition system has been verified in Shanxi Tiandi Wangpo Coal Mine. The application results show that the data model is designed according to the unified standard specifications, and the coal mine production monitoring data acquisition and monitoring are realized through the deployment and operation of data acquisition, data service, visualization and other application microservices, which provides important data support and basic support for other applications in the coal mine. At the same time, the operation mode of microservices improves the operation and maintenance efficiency of coal mine system maintenance personnel.

Abstract:In order to improve accuracy, image semantic segmentation networks often use complex convolutional layers as the basic feature extraction units. The different types of convolutions present in these convolutional layers increase the difficulty of parallel acceleration computation for the network. A parallel computing accelerator based on FPGA for multi type convolutions is proposed to meet the accelerated computing requirements of different types of convolutions in semantic segmentation networks. Firstly, the calculation principle of convolution is analyzed. Then, based on the basic operation principles of different convolution types, a processing unit for multi multiplication parallel computing is constructed. The convolution is accelerated through multi processing unit parallelism, data reuse, and PIPELINE method. The experimental results show that for specific size feature maps, using the proposed convolutional accelerator design method can achieve a maximum speed increase of 113 times.

Abstract:This paper proposes a general CNN hardware accelerator design scheme based on FPGA. For the most computationally intensive convolutional layer, three acceleration modes are adopted: input channel parallelism, intra-core parallelism, and output channel parallelism, and the corresponding parallelism degree is reasonably set according to the on-chip resources of FPGA. In terms of data loading, adjacent data bit width combined transmission is adopted, which effectively improves the actual transmission bandwidth of the accelerator. Based on the idea of row-based data flow loading, the input cache module is designed. The cache module only needs to cache two rows of data to start the convolution operation, effectively advancing the start time of the convolution operation. Between the data input, data operation, and data output modules, the pipeline cycle optimization method is used to greatly improve the computing performance of the hardware. Finally, the accelerator is applied to VGG16 and Darknet-19 networks, and experiments show that the computing performance reaches 34.30 GOPS and 33.68 GOPS, respectively, and the DSP computing efficiency is as high as 79.45% and 78.01%.

Abstract:With the deepening of neural network layers, the sparse deep neural network has more advantages in computing and storage space, but the performance of the sparse deep neural network still needs to be optimized. Therefore, a performance optimization method based on GPU sparse deep neural network is proposed, which adjusts the order of computation, enhances the reusability of data, and combines the unique structure of GPU with CUDA programming method, performance is further improved by prefetching and other methods. According to GraphChallenge's official data set, it achieved up to 2.5 times the performance acceleration compared to the related cuSPARSE library functions.

Abstract:The autonomous driving simulation platform is an effective means to solve the problems of long actual vehicle testing time, high cost, and difficulty in reproducing extreme scenarios in autonomous driving. With the introduction of large-scale cloud computing and high-performance heterogeneous computing, the simulation platform faces the challenges of efficient virtualization, balanced scheduling, and convenient end cloud interaction. Therefore, this article designs a parallel system architecture that supports lightweight virtualization, and integrates fine-grained resource balancing scheduling and low latency remote interaction methods. It constructs an ADsim high-performance parallel simulation platform, which has been tested and verified to have high performance and convenient interaction characteristics.

Abstract:Molecular dynamics (MD) simulation is an important tool to explore the microcosmic world, which is widely used in many fields. Two-dimensional materials is an important research direction of MD in the field of materials science, in which the calculation of interlayer interactions is the most time-consuming part. High performance computing is a key technology to improve the simulation efficiency of two-dimensional materials. In this work, the powerful computing power of a new generation of Sunway supercomputer is utilized to improve the MD simulation efficiency of two-dimensional materials. For the interlayer force field, multiple algorithm optimization strategies such as eliminating redundant calculation, multi-core fusion and setting buffer are adopted. It implements thread-level parallelism by software Cache accumulation force, communication between computation processing elements accumulation energy and C++ feature. The hardware and software Cache coordination policy is adopted to improve storage access efficiency. The experimental results show that the overall performance is improved by 155 times, the simulation efficiency is about 2 ns/day, and the weak extension parallel efficiency reaches 92.7%.

Abstract:Based on the standard 0.18 μm process, a digital decimation filter applied to the Sigma-Delta analog-to-digital converter is designed, which can change the decimation rate and adapt to different signal bandwidths. The filter adopts multi-stage decimation and consists of a cascaded integrator comb filter, a compensation filter and a half-band filter. The realized digital filter can be changed in the decimation rate of 64,128,256 and 512. Compensation filters and half-band filters of different bandwidths are also designed. The filter area is 0.6 mm×0.6 mm. Under 1.98 V working voltage, the total maximum power consumption is about 2 mW, and the highest signal-to-noise ratio reaches 110.5 dB. When the passband frequency of the compensation filter and the half-band filter is selected according to the bandwidth from the highest to the lowest, it can save 61% and 53% of the power consumption respectively; When the filter power consumption being the smallest 69.63 μW, the bandwidth that can be processed is 390.6 Hz, and the signal-to-noise ratio is 107.8 dB.

Abstract:For the status quo of lack of high-performance platforms that support the access, management and operation of upper-layer applications such as dynamic planning and operation optimization of digital twins in regional multi-energy systems, this paper takes into account the requirements of digital twin applications for communication efficiency and data security, with the help of CloudPSS cloud simulation platform, the software and hardware of a secure, flexible, and scalable digital twin application platform including the basic layer, application layer, and business layer are designed. Taking Guizhou Hongfeng Lake regional multi-energy system as the pilot area, it introduced the construction process of the regional multi-energy system digital twin application platform including modeling and simulation, data communication and application integration functions. The built regional multi-energy system digital twin application platform serves as an ideal tool for realizing digital twins from concept to application, its architecture design plan and specific construction experience can provide theoretical and practical references for the wide implementation of digital twin technology.

Abstract:High penetration of renewable energy in power grid challenges power system operation and stability evaluation, due to the fast dynamic behaviors of power electronic devices. The accurate broadband impedance measurement is one of the critical aspects of power system stability analysis. The impedance measurement method of injecting harmonics to the actual power system may bring the risk of instability. The hardware-in-the-loop simulation technology can achieve accurate broadband impedance measurement with lower cost and risk. The computing accuracy and simulation scale of real-time simulators depend on the computing power of the target computer. As the demands of large-scale power system increase, a single real-time simulation system may not meet the requirements. Co-simulation between multiple real-time simulators is an effective way to increase the scale of the simulation. This paper proposed an online impedance analysis co-simulation framework based on the CloudPSS-RT and RT-Lab platform and verified the accuracy and effectiveness of the co-simulation by constructing a PV and DFIG power generation system.

Abstract:Redis is an unstructured database based on memory storage. It is known for high I/O(Input/Output) performance and high response speed. It plays an important role in data buffering, message queues, key-value storage and other scenarios. Among the many clients it supports, the C/C++ client Hiredis is particularly widely used. This article did an in-depth analysis of the Hiredis library and found that its pipeline function has high overhead, improper instruction storage, and memory confusion problems. Based on this, this article designs and implements a C/C++-oriented high-performance and high-availability Redis client on a 32-core X86 architecture processor and a 64 GB memory Linux server. It improves the performance of processing a large number of instructions and solves the problem of memory confusion in complex scenarios through memory pre-allocation and memory isolation. After testing, the new client has improved instruction execution efficiency by 3~7 times, while also ensuring memory safety and accuracy in complex scenarios.

Abstract:The new generation Global/Regional Assimilation and PreEdiction System(GRAPES) is a homegrown numerical weather prediction software developed by China Meteorological Administration(CMA). As the requirements for model resolution and prediction timeliness increase, the Input/Output(I/O) performance of GRAPES becomes a critical performance bottleneck. This paper performs a deep analysis of I/O behavior for the GRAPES regional model,and proposes, designs and implements a high-performance I/O framework. This framework achieves a flexible and configurable output method through binary encoding and multiple I/O channels. At the same time, asynchronous I/O is included by non-blocking communication, which hides the I/O and communication overhead. The framework has been tested on the Sugon Pai supercomputer, and the results show that the framework can not only improve I/O performance by up to over ten times but also reduce the performance uncertainty caused by performance jitter.

Abstract:CWRF(Climate-Weather Research and Forecasting model) is a component of the regional climate prediction system built in the National Climate Center, and consumes the largest proportion of time. High performance computing is a key technology used to improve the compactional performance of CWRF. Carrying out the configuration and optimization of the CWRF model based on the domestic Sunway many-core system, improving the simulation efficiency are of great significance for the speedup, as well as the development capability and sustainable development of the model. This paper completed the configuration and performance evaluation of CWRF based on the SW26010 many-core architecture. Memory access optimization, Cache hit rate optimization, many-core acceleration models are introduced to speedup CWRF relating to the dynamic-core process, physical process and I/O process. The results show that the average speed of the dynamic process is 2 times and the highest speed is 6.4 times, the average speed of the physical process is 1.7 times and the highest speed is 5.4 times, the I/O process speeds up 1.2 times, the overall program speeds up to 1.4 times, and the calculation error is reasonable.

Abstract:The Community Earth System Model(CESM) is a numerical model to quantitatively describe the change of climate system model, which is one of the most important research objects in the field of high-performance computing because of its huge volume of scientific computing. The load imbalance between each meteorological sub module and component of CESM makes its computing performance unsatisfactory. It is not realistic to retrieve the optimal layout manually by enumerating parameters because the diversity of available process layout schemes will lead to a huge amount of retrieval. In order to solve this problem, this paper proposed and implemented a retrieval strategy based on the matrix-nesting idea of load balancing optimization scheme to help the process layout and intervenes in the screening work based on the parallel requirements of the original model. Finally, the experiment proved that the optimal layout obtained by this search strategy search had a performance improvement of 47.3% compared with the default layout and achieved an acceleration ratio of 1.419 on 5 nodes.

Abstract:Traditional sorting methods are mainly implemented in software serial mode, including bubble sorting, selective sorting and so on. These algorithms often use sequential comparison, and the operation time complexity is relatively high. In recent years, some sorting algorithms with a high degree of parallelism have been proposed, but due to the hardware characteristics of the CPU, the parallelism of these algorithms cannot be used well. And FPGA has the characteristics of good flexibility, parallelism and integration, so the advantages of these parallel algorithms can be better utilized on FPGA, thereby greatly improving the real-time performance of data sorting. Based on this, the paper designs a CPU-FPGA heterogeneous system, transplants some sorting algorithms to FPGA, and performs functional verification and theoretical performance evaluation. The results show that the system has a good acceleration effect for sorting algorithms with high parallelism, but consumes huge logic resources, and is suitable for algorithm acceleration scenarios with high real-time requirements.

Abstract:The traditional CORDIC(Coordinate Rotation Computer) algorithm has many iterations, slow convergence speed, and large resource consumption for high-precision arctangent. An improved high-precision CORDIC algorithm is proposed. This method uses the traditional CORDIC algorithm to obtain the sine information after several iterations, and uses the sine value to compensate the error of iteration results,which effectively improves the calculation accuracy. Experimental data shows that the 32 bit improved CORDIC algorithm ensures that the absolute error is less than 5×10-9, the resource consumption of the lookup table is reduced by 64.8%, the resource consumption of the flip-flop is reduced by 35.3%, and the output delay is reduced by 53.3%. In molecular dynamics application scenarios, flip-flop resource consumption can be reduced by 63.2%, and output delay can be reduced by 60%. The improved CORDIC algorithm is superior to the traditional CORDIC algorithm in terms of resource consumption and output delay, and is suitable for high-precision computing applications.

Abstract:Digital-analog hybrid simulation is essential for understanding the real power grid and supporting power grid security. Complex power network topology and hard real-time simulation put forward high requirements for computing performance. At present, digital-analog hybrid simulation mainly uses parallel computing technology to improve computing performance. With the development of processor and cluster technology, heterogeneous cluster systems have gradually become the primary construction method of high-performance computing systems. For the multi-level system architecture, the existing power grid division methods can not fully use the cluster computing power. Dealing with the high latency of cross-layer communication and the unequal number of available processor cores on each computing node due to heterogeneous acceleration equipment is the main challenge of the partitioning and mapping algorithm. Aiming at the electromagnetic transient simulation system ADPSS developed by China Electric Power Research Institute, this paper designs a two-stage integrated optimization algorithm of power grid division and process mapping, which achieves a better load balance and minimizing communication, and further reduces the calculation time of the electromagnetic transient simulation. The algorithm is based on the min-cut partition and effectively solves the optimal mapping of sub-networks of unequal sizes on heterogeneous cluster systems. The simulation test was realized on the real power grid in Northwest and East China, compared with the ADPSS default partition and mapping algorithm, the proposed algorithm achieves an average communication performance improvement of 40% and 50% and an average overall computing performance improvement of 10% and 12%.